Skip to content

Panorama Optimizations#5041

Open
aknayar wants to merge 26 commits intofacebookresearch:mainfrom
aknayar:optimize-pano
Open

Panorama Optimizations#5041
aknayar wants to merge 26 commits intofacebookresearch:mainfrom
aknayar:optimize-pano

Conversation

@aknayar
Copy link
Copy Markdown
Contributor

@aknayar aknayar commented Apr 3, 2026

Note: Should be merged before #4970 (IVFPQPanorama).

Changes

Performance

This PR implements various optimizations to Panorama (L2Flat and IVFFlat).

  1. Disaggregate distance computation from pruning decisions to avoid branches in distance computation hotpath.
  2. Early batch processing termination when no points are remaining.
  3. Manually unrolled distance kernel.
  4. Template distance computation on level width for autovectorization.
  5. if constexpr (C::is_max) instead of C::cmp for autovectorized pruning.
  6. Byteset for vectorized compacting of active indices using _pext_u64.
  7. Template distance computation and pruning on first level (no active_indices indirection) to let it autovectorize.
  8. Hoist buffer allocations into IndexFlat/IVFFlatScannerPanorama.
  9. Expose batch_size as a parameter for IVFFlatPanorama (for consistency with IndexFlatPanorama but also because 1024 batch_size can improve performance).

Other

  • Define kDefaultBatchSize once in Panorama.h (previously defined in 5 separate locations).
  • Allow bench_flat_l2_panorama.py and bench_ivf_flat_panorama.py to accept gist1M or sift1M as dataset to bench on.

Results

Together, these optimizations enable powerful additional speedups, especially on lower-dimensional datasets like SIFT (128d), by dramatically minimizing Panorama's overhead:

GIST1M (IVF128, nlist=128, nlevels=16)

nprobe Recall@10 Old Speedup New Speedup Additional Speedup
1 0.1439 3.92x 3.93x 1.00x
2 0.2605 4.71x 5.19x 1.10x
4 0.4369 5.53x 6.75x 1.22x
8 0.6470 6.37x 8.21x 1.29x
16 0.8780 7.30x 9.74x 1.33x
32 0.9764 8.33x 11.29x 1.36x
64 0.9868 9.38x 12.74x 1.36x

SIFT1M (IVF128, nlist=128, nlevels=8)

nprobe Recall@10 Old Speedup New Speedup Additional Speedup
1 0.2678 1.20x 1.81x 1.52x
2 0.4584 1.38x 2.23x 1.62x
4 0.6855 1.59x 2.70x 1.70x
8 0.8760 1.83x 3.44x 1.88x
16 0.9679 2.11x 4.72x 2.24x
32 0.9855 2.44x 5.61x 2.30x
64 0.9861 2.74x 6.39x 2.33x

Raw Data

Collected by running the new benches on main and this branch. On main, you cannot specify batch_size so remove the {1024} from the factory string in the new benches to run them there. The results above are calculated from the following raw data as follows:

  1. For each experiment (i.e., GIST (old) or SIFT (new), calculate the Panorama speedups for each nprobe ((original ms per query) / (pano ms per query))
  2. For each pairing of (old) and (new) results, calculate the additional speedup by calculating (new speedup) / (old speedup).

Before (main)

GIST1M:

======IVF128,Flat
	nprobe   1, Recall@10: 0.145200, speed: 2.705442 ms/query, dims scanned: 100.00%
	nprobe   2, Recall@10: 0.260800, speed: 5.456891 ms/query, dims scanned: 100.00%
	nprobe   4, Recall@10: 0.441900, speed: 10.895120 ms/query, dims scanned: 100.00%
	nprobe   8, Recall@10: 0.648200, speed: 21.676788 ms/query, dims scanned: 100.00%
	nprobe  16, Recall@10: 0.878000, speed: 43.142261 ms/query, dims scanned: 100.00%
	nprobe  32, Recall@10: 0.975400, speed: 84.498397 ms/query, dims scanned: 100.00%
	nprobe  64, Recall@10: 0.986800, speed: 160.092644 ms/query, dims scanned: 100.00%
======PCA960,IVF128,FlatPanorama16
	nprobe   1, Recall@10: 0.143900, speed: 0.689507 ms/query, dims scanned: 12.96%
	nprobe   2, Recall@10: 0.260500, speed: 1.158416 ms/query, dims scanned: 11.18%
	nprobe   4, Recall@10: 0.436900, speed: 1.968814 ms/query, dims scanned: 9.90%
	nprobe   8, Recall@10: 0.647000, speed: 3.401469 ms/query, dims scanned: 8.91%
	nprobe  16, Recall@10: 0.878000, speed: 5.912757 ms/query, dims scanned: 8.10%
	nprobe  32, Recall@10: 0.976400, speed: 10.147847 ms/query, dims scanned: 7.44%
	nprobe  64, Recall@10: 0.986800, speed: 17.074573 ms/query, dims scanned: 6.93%

SIFT1M:

======IVF128,Flat
	nprobe   1, Recall@10: 0.267480, speed: 0.285990 ms/query, dims scanned: 100.00%
	nprobe   2, Recall@10: 0.457520, speed: 0.564067 ms/query, dims scanned: 100.00%
	nprobe   4, Recall@10: 0.685320, speed: 1.111833 ms/query, dims scanned: 100.00%
	nprobe   8, Recall@10: 0.877210, speed: 2.195088 ms/query, dims scanned: 100.00%
	nprobe  16, Recall@10: 0.967730, speed: 4.338444 ms/query, dims scanned: 100.00%
	nprobe  32, Recall@10: 0.985400, speed: 8.500538 ms/query, dims scanned: 100.00%
	nprobe  64, Recall@10: 0.986100, speed: 16.349893 ms/query, dims scanned: 100.00%
======PCA128,IVF128,FlatPanorama8
	nprobe   1, Recall@10: 0.267670, speed: 0.239243 ms/query, dims scanned: 27.97%
	nprobe   2, Recall@10: 0.458320, speed: 0.408590 ms/query, dims scanned: 24.42%
	nprobe   4, Recall@10: 0.685480, speed: 0.699694 ms/query, dims scanned: 21.50%
	nprobe   8, Recall@10: 0.875930, speed: 1.197310 ms/query, dims scanned: 19.06%
	nprobe  16, Recall@10: 0.967760, speed: 2.055968 ms/query, dims scanned: 16.98%
	nprobe  32, Recall@10: 0.985370, speed: 3.481555 ms/query, dims scanned: 15.26%
	nprobe  64, Recall@10: 0.985980, speed: 5.977346 ms/query, dims scanned: 14.02%

After (optimize-pano)

GIST1M:

======IVF128,Flat
	nprobe   1, Recall@10: 0.145200, speed: 2.625779 ms/query, dims scanned: 100.00%
	nprobe   2, Recall@10: 0.260800, speed: 5.285007 ms/query, dims scanned: 100.00%
	nprobe   4, Recall@10: 0.441900, speed: 10.555867 ms/query, dims scanned: 100.00%
	nprobe   8, Recall@10: 0.648200, speed: 21.012494 ms/query, dims scanned: 100.00%
	nprobe  16, Recall@10: 0.878000, speed: 41.794143 ms/query, dims scanned: 100.00%
	nprobe  32, Recall@10: 0.975400, speed: 81.865038 ms/query, dims scanned: 100.00%
	nprobe  64, Recall@10: 0.986800, speed: 155.067333 ms/query, dims scanned: 100.00%
======PCA960,IVF128,FlatPanorama16_1024
	nprobe   1, Recall@10: 0.143900, speed: 0.668800 ms/query, dims scanned: 20.33%
	nprobe   2, Recall@10: 0.260500, speed: 1.018440 ms/query, dims scanned: 14.81%
	nprobe   4, Recall@10: 0.436900, speed: 1.563622 ms/query, dims scanned: 11.72%
	nprobe   8, Recall@10: 0.647000, speed: 2.557981 ms/query, dims scanned: 9.82%
	nprobe  16, Recall@10: 0.878000, speed: 4.292616 ms/query, dims scanned: 8.56%
	nprobe  32, Recall@10: 0.976400, speed: 7.248832 ms/query, dims scanned: 7.68%
	nprobe  64, Recall@10: 0.986800, speed: 12.171319 ms/query, dims scanned: 7.06%

SIFT1M:

======IVF128,Flat
        nprobe   1, Recall@10: 0.267480, speed: 0.295904 ms/query, dims scanned: 100.00%
        nprobe   2, Recall@10: 0.457520, speed: 0.583204 ms/query, dims scanned: 100.00%
        nprobe   4, Recall@10: 0.685320, speed: 1.150055 ms/query, dims scanned: 100.00%
        nprobe   8, Recall@10: 0.877210, speed: 2.425575 ms/query, dims scanned: 100.00%
        nprobe  16, Recall@10: 0.967730, speed: 5.509365 ms/query, dims scanned: 100.00%
        nprobe  32, Recall@10: 0.985400, speed: 10.794491 ms/query, dims scanned: 100.00%
        nprobe  64, Recall@10: 0.986100, speed: 20.727924 ms/query, dims scanned: 100.00%
======PCA128,IVF128,FlatPanorama8_1024
        nprobe   1, Recall@10: 0.267750, speed: 0.163266 ms/query, dims scanned: 34.97%
        nprobe   2, Recall@10: 0.458370, speed: 0.261109 ms/query, dims scanned: 27.97%
        nprobe   4, Recall@10: 0.685540, speed: 0.425977 ms/query, dims scanned: 23.30%
        nprobe   8, Recall@10: 0.875990, speed: 0.704580 ms/query, dims scanned: 19.98%
        nprobe  16, Recall@10: 0.967860, speed: 1.167465 ms/query, dims scanned: 17.45%
        nprobe  32, Recall@10: 0.985470, speed: 1.925296 ms/query, dims scanned: 15.50%
        nprobe  64, Recall@10: 0.986080, speed: 3.245793 ms/query, dims scanned: 14.14%

@meta-cla meta-cla bot added the CLA Signed label Apr 3, 2026
@aknayar aknayar marked this pull request as draft April 3, 2026 22:43
}

float lower_bound = exact_distances[idx] - cauchy_schwarz_bound;
if constexpr (C::is_max) {
Copy link
Copy Markdown
Contributor Author

@aknayar aknayar Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately C::cmp() kills autovectorization here so we resort to this workaround.

write_ivf_header(ivfp, f);
WRITE1(ivfp->n_levels);
WRITE1(ivfp->batch_size);
if (ivfp->batch_size == Panorama::kDefaultBatchSize) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility.

@aknayar aknayar marked this pull request as ready for review April 4, 2026 19:34
* accelerating the refinement stage.
*/
struct Panorama {
static constexpr size_t kDefaultBatchSize = 128;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering defining kLegacyDefaultBatchSize = 128 and kDefaultBatchSize = 1024 to update the default and have a fallback for the old indexes which were created with 128. Is such a change in default behavior allowed (IVF128,FlatPanorama8 would then silently use 1024 batch_size instead of 128)?

}

template <typename Lambda>
inline auto with_bool(bool value, Lambda&& fn) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if there's a more appropriate location to define this.

# All modern CPUs support F, CD, VL, DQ, BW extensions.
# Ref: https://en.wikipedia.org/wiki/AVX512
target_compile_options(faiss_avx512 PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-mavx2 -mfma -mf16c -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mpopcnt>)
target_compile_options(faiss_avx512 PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-mavx2 -mfma -mf16c -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mpopcnt ${FAISS_BMI2_FLAGS}>)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have to add this to avx512_spr as well once #5034 goes in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant